arXiv:1501.06350v3 [cs.DS] 6 May 2015 


D-Iteration: diffusion approach for solving 

PageRank 


Dohy Hong^, The Dang Huynh^, and Fabien Mathieu^ 
^ Samsung 

^ Alcatel-Lucent Bell Labs France 


Abstract. In this paper we present a new method that can acceler¬ 
ate the computation of the PageRank importance vector. Our method, 
called D-Iteration (DI), is based on the decomposition of the matrix- 
vector product that can be seen as a fluid diffusion model and is po¬ 
tentially adapted to asynchronous implementation. We give theoretical 
results about the convergence of our algorithm and we show through ex¬ 
perimentations on a real Web graph that DI can improve the computa¬ 
tion efficiency compared to other classical algorithm like Power Iteration, 
Gauss-Seidel or OPIC . 


1 Introduction 

PageRank is a link analysis algorithm that has been initially introduced in m 
and used by the Google Internet search engine. It assigns a numerical value to 
each element of a hyper-linked set of nodes, such as the World Wide Web. The 
algorithm may be applied to any collection of entities (nodes) that are linked 
through directional relationships. The numerical value assigned to each node 
called PageRank, is supposed to reflect the structural importance of nodes. 

Although PageRank may today only be a small part of Google’s ranking 
algorithm (the complete algorithm is obviously kept secret, but it seems to take 
into account hundreds of parameters, most of them been related to the user’s 
profile), it stays appealing, especially in the research community, as it balances 
simplicity of definition, ranking efficiency and computational challenges. 

Among these challenges are the growing size of the dataset (Web graphs can 
have tens of billions of nodes) and the dynamics ot the structure that requires 
frequent updates. 

The complexity of computing the PageRank of a graph rapidly increases with 
the number of nodes, as it is equivalent to computing an eigenvector on some 
huge, matrix, and efficient and accurate methods to compute eigenvalues and 
eigenvectors of arbitrary matrices are in general a difficult problem. In the par¬ 
ticular case of the PageRank equation, several specific solutions were proposed 
and analysed mm including the power method m with adaptation [in] or 
extrapolation ma, or the adaptive on-line method [T], etc. 

In this paper we present some theoretical results of D-Iteration (DI): mathe¬ 
matical definition, convergence to the fixed point, measurement of the distance to 




the limit and solution for dynamic graph update issues. The results are validated 
on a real dataset. 

The rest of the paper is organized as follows. Section [5] recalls the PageRank 
equation and exposes some of the main methods used to solve it. Section [3] 
formally defines the D-Iteration method and provides a few theoretical results 
on its convergence and use in a dynamic context. Lastly, Section 0] proposes a 
numerical evaluation of DFs performance and Section [S] concludes. 

2 The PageRank challenge 

In this section, we shortly introduce the definition of PageRank along with some 
standard algorithms to compute it. For more detailed information, one may for 
example see m- 

2.1 Definition 

The informal definition of PageRank is rather simple: it is an importance vector 
over Web pages such that important pages are referenced by important pages [B] . 

More formally, let G = {V, E) be a weighted, oriented, graph. The size of 
G is n = \V\ and Wij > 0 is the weight of edge (z,j) £ E. G represents a set 
of nodes their (weighted, oriented) relationships. In [B], G was a Web graph, 
V representing web pages and E hyperlinks, but the principle applies to most 
structured sets. 

Let P be a n X n diffusion matrix defined by: 



( 1 ) 


P is a left substochastic matrix, column i summing to 1 if node i has an 
outgoing edge, 0 if f is a dangling node. Note that: 

— For unweighted graphs, the expression of P is simpler: for {j, i) G P, we just 
have Pi^j = l/out(j), where out(j) is the out-degree of j. 

— Some variants of PageRank require P to be stochastic. For these variants, 
one usually pads the null columns of P with i {dangling nodes completion). 

P represents how importance flows from node to node. When it is stochastic, 
it represents the Markov chain over V implied by the edges E. In that case, the 
PageRank can be defined as a stationary state of the Markov chain, that is a 
distribution x over V that verifies 


X = Px. 


( 2 ) 


Note that x is unique if G is strongly connected. 

In practice, the following variant is used instead of (I2|): 


X = dPx -I- (I — d)Z, 


(3) 



where 0 < d < 1 is called damping (often set to 0.85), and Z a default distribution 
over V (often set to the uniform distribution). 

If P is stochastic, the solution a: of ([3]) is a distribution, which corresponds to 
the Markov chain defined by: with probability d, use the Markov chain induced 
by P] otherwise jump to a random node according to distribution Z. 
Introducing parameters d and Z has many advantages: 

— It guarantees the existence and uniqueness of a solution for any substochastic 
matrix P, without any assumption on G. 

— It speeds up the PageRank computation (cf below). 

— Parameter d introduces some locality: influence of a node at distance k is re¬ 
duced by a factor d*. This strengthens the impact of the local structure and 
mitigates the possibility of malicious PageRank alterations through tech¬ 
niques like links farm [2]. 

— Parameter Z allows to customize the PageRank. For instance, one can con¬ 
centrate the default importance on pages known to talk about some given 
topic to create a topic-sensitive PageRank]^. 

In the rest of the paper, unless stated otherwise, we focus on solving ([3]). 
Writing the solution is straightforward: 

a: = (1 — d){I — dP)~^Z, where I is the identity matrix. (4) 

However, such a direct approach cannot be used due to the size of P that 
forbids an explicit computation of (/ — dP)~^. Instead, one can use different 
iterative methods (cf [T3]l. 

2.2 Po-wer Iteration 

The simplest approach is Power Iteration (PI): starting from an initial guess 
vector cco, the stationary PageRank vector is iteratively computed using ©: 

Xk+i = dPxk + (1 - d)Z, (5) 

until the change between two consecutive vectors is negligible. It is straightfor¬ 
ward that the error decays by a factor at least d at each iteration (hence one 
of the interests of introducing a damping factor). PI requires to maintain two 
vectors Xk and Xk+i- 

2.3 Gauss-Seidel 

The Gauss-Seidel (GS) applied to PageRank consists in using the updated 
entries of Xk as they are computed. During an iteration round, entries Xk+i{i) 
are computed from i = 1 to i = n using: 


Xk+iii) = d '^PijXk+iij) + '^PijXk{j) + (1 - d)Z{i). (6) 


Thanks to the immediate update, one needs to maintain only one vector 
and the convergence is faster, typically by a factor 2 asymptotically. The main 
downside of the update mechanism is the necessity to access the entries in a 
round-robin fashion, which can cause problems in a distributed scenario. Note 
that Gauss-Seidel belongs to a larger class of methods called Successive Over¬ 
relaxation (SOR), but other SOR variants are seldom used for Web PageRank 
computations [7] . 


2.4 Online Page Importance Computation 


An algorithm very close to ours is the Online Page Importance Computa¬ 
tion (OPIC) proposed in [I]. Its core idea: most PageRank algorithms implicitly 
use a pull approach, where the state of a node is updated according to the states 
of its incoming neighbors. By contrast, OPIC proposes a push approach, where 
the state of a node is read and used to update the states of its outgoing neighbors. 
In details, OPIC focuses on solving ([5]) for a modified graph G' = {VUz,EL)J), 
where z is a virtual zap node and J = (P x z) U ( 2 ; x P) is all possible edges 
between P and z. This was introduced to make P stochastic and irreducible, 
allowing ([5]) to admit a unique solution. 

OPIC algorithm works as follows: initially, each node receives some amount 
of fluid (a non-negative number) and a null history record. A scheduler, which 
can be associated to a web crawler, iterates among the nodes. When a node i is 
selected, its fluid F{i) is, in order. 


— credited to its history: H{i) = H{i) + F{i)-, 

~ equally pushed to its neighbors: for all j that are outgoing neighbors of i, 
F{j) = F{j) ^ 

— cleared: F{i) = 


OUt(2 

FP) 

ou 


^ if i has a loop, F{i) = 0 otherwise. 


It has been shown that as long as the scheduler is fair (i.e. any given node is 
selected infinitely often) then the history vector converges, up to normalization, 
to the desired solution [I]. 

The main advantage of OPIC is its flexibility. In particular, it is easy to adapt 
and incorporate to a continuous, possibly distributed, Web crawler, allowing to 
get a dynamic, lightweight, PageRank importance estimation. One drawback is 
that it is not designed to work with (jS]). 


2.5 Other Variants 

Due to lack of space, we only briefly describe a few other methods that have 
been proposed. 

The Generalized Minimal Residual (GMRES) [IS] provides an approach using 
Arnoldi process to find the stationary vector in Krylov subspace. It is claimed 
to perform better than other algorithms in terms of iterations, but it requires 
more elementary computation steps per iteration. 




Quadratic Extrapolation [T^] uses estimates of secondary eigenvectors of P 
to speed up the convergence of PI. 

Lastly, a few methods have been proposed that take advantage of the locality 
of hyperlinks (most hyperlinks are internal to Web sites) to decompose PageRank 
computation into intra-site and extra-site values nmi?]. These methods are 
especially adapted to distributed computation, but they usually only output an 
approximation of the solution of ([3]). 

3 D-Iteration Algorithm 

Our proposal, D-Iteration, aims at solving Equation ([3]) with an efficiency similar 
to Gauss-Seidel while keeping the scheduling flexibility offered by OPIC. This 
result in a fluid diffusion approach similar to OPIC with some damping added 
to the mix. 


3.1 Definition 

In the following, we assume that a deterministic or a random diffusion sequence 
X = {ii, 12 ,..., ik, ...} with ik € {1,.., n} is given. We only require that the number 
of occurrences of each value of ik in X is infinite (the scheduler X is fair). X is 
not obliged to be fixed in advance but can be adjusted on the fly as long as the 
fairness property stands. 

Like with OPIC, we have to deal with two variable vectors at the same time: 
a fluid vector F, initially equal to (I — d)Z and a history vector H, initially 
null for all nodes. When a node is selected, its current fluid value is added to its 
history, then a fraction d of its fluid is equally pushed to its neighbors and its 
fluid value is cleared. 

Eormally, the fluid vector F associated to the scheduler X is iteratively up¬ 
dated using: 


Fo = (1 - d)Z, (7) 

Pk — Pk—l F dFk—l(ik)PCif^ Fk—l(ikflikj (8) 

where is the standard basis vector corresponding to ik- 

The second term in ([5]) represents the damped diffusion and the third term 
clears the local fluid (up to loops). Similarly to OPIC, an iteration reads one 
value, Fk-i(ik) and updates ik and its outgoing neighbors. Note that P is always 
non-negative. 

We also formally define the history vector iJ: 


iLo = 0, 

Fk — Hk — \ F Fk-liik^^ik- 


By construction, Fdk is non-decreasing with k. 


(9) 

( 10 ) 




3.2 Convergence 


The following Theorem states the convergence of the D-Iteration algorithm; 

Theorem 1. For any fair sequence T, the history vector Fl^ converges to the 
unique vector x such that x = dPx + {I — d)Z: 

lim iJfe = (1 - d){I - dP)-^Z. 

k—^co 


Moreover, we have 


\Fk\ 


\x — F[k\ < where | ■ | is the Li norm. 


Proof. We first prove the equality: 

Hk F Fk = Fq F dP Flk • 


( 11 ) 


( 12 ) 


This is straightforward by induction: m is true for k = 0; assuming it is 
true for k — 1, we have 


Fdk T Fk — Flk — l F Fk — xifkfj^ik F Fk — i F dFk — ii^ik^Pf^i,,, Fk—\{ik)^ii, 

= Fq + dP{F[k-i F Fk-i{ik)&ik) = ^0 + dPFlk. 

From the equation [T2j we have: 

Hk = {I - dP)-\FQ - Fk) 

= x- EZo d^F^Fk 

Noticing that P is substochastic, we get 


oo oo IP I 

\x-Hk\ = \J2d^P^Fk\<Y.^'\Fk\ = j^. 

z—0 z—0 

All we need is to show that \Fk \ tends to 0. Notice that the total available fluid 
is non-increasing. That being said, the fluid of a given node is non-decreasing 
until it is scheduled, and when it is, a quantity (1 — d) of it is “lost” due to 
the damping (or more if it is a dangling node). Given these two observations, 
let us consider a time k and another time k' > k such that all nodes have been 
scheduled at least once between k and k' (this is always feasible thanks to the 
fairness assumption). For each node, its fluid at the time of its first scheduling 
after k is greater than its fluid at time k, so we have \Fk'\ < d\Fk\. This is true 
for any k (including k') and concludes the proof. 




3.3 Implicit completion 

Equation dm is an equality if P is stochastic (for example thanks to dangling 
node completion), in which case we have \P^Fk\ = |Efc|. 

One may want to solve dD for a completed P. However, if the completion 
follows the distribution Z, one observes that the result is just proportional to 
the one without completion, so this is just a matter of normalization. 

It is more efficient to perform the computation on a non-completed matrix 
(every time a dangling node is selected, all non-null entries of Z are updated if 
P is completed). The question is: can we control the convergence to the solution 
of the completed matrix while running the algorithm on the original one? 

To address this problem precisely, we count the total amount of fluid that 
has left the system when a diffusion was applied on a dangling node. We call 
this quantity Ik (at step k of the DI). To compensate this loss and emulate 
completion, a quantity dlkZ should have been added to the initial fluid, leading 
to (1 — d -b dlk)Z instead of (1 — d)Z. But then the fluid dlkZ would have 
produced after k steps a leak (dZ^/(l — d))Z on dangling nodes, which needs to 
be compensated... 

In the end, the correction that is required to compensate the effect of dangling 
nodes on the residual fluid |idc| consists in replacing the initial condition |Tb| = 
(1 — d) by |Fq| such that: 



{1-dr 


1 — d — dlk 


As |Eq|/|Fo| = (1 — d)/(l — d — dlk), Hk needs to be renormalized (multipli¬ 
cation) by (1 — d)/(l — d — dlk) so that the exact Li distance between x and the 
normalized H is equal to: 


1 — d — dlk 1 — d — dlk 


To summarize, we can run the algorithm on the original matrix using 
as a stopping condition that guarantees the precision of the normalized result. 

3.4 Schedulers 

The actual performance of DI is directly related to the scheduler used. A simple 
scheduler, which we call Dl-cyc, is a Round-Robin (RR) one, where a given 
permutation of nodes is repeated as long as necessary. 

Theorem 2. For any Round-Robin scheduler X, we have: 


\x - F[k\ < dL"J. 


( 13 ) 









The proof is a direct application of the proof of Theorem [T] considering suc¬ 
cessive sequences of n steps. 

Theorem [5] ensures that D-Iteration performs at least as well as the PI 
method: in both cases, after a round where all nodes have been processed once, 
the error id reduced by at least d. 

While the bound can be tight for specific situations (for instance a clockwise 
scheduler applied to a counterclockwise-oriented cycle), it is conservative in the 
sense that it ignores that some of the fluid can be pushed multiple times during 
a sequence of n steps. For that reason, and keeping in mind that D-Iteration is 
a push version of the Gauss-Seidel method, we expect that Dl-cyc will perform 
more like Gauss-Seidel in practice. 

However, one strength of D-Iteration is the freedom in the choice of a sched¬ 
uler. As the convergence is controlled by the remaining fluid |Ffc|, a good sched¬ 
uler is one that makes the fluid vanish as quickly as possible. This disappear¬ 
ance is caused by two main parameters: the damping factor d and dangling 
nodes (cf above). Reminding that at time k a quantity (I — d)Fk-i{ik) vanishes 
through damping, a natural greedy strategy would consist in selecting at each 
step ik = argmax^^i ^„F’fc_i(z). 

The main drawback of such a strategy is the expensive searching cost. There¬ 
fore we propose a simple heuristic called Dl-argmax as follows: we use a RR 
scheduler, but at each iteration, we run through the scheduler until we find a 
node that possesses a fluid greater or equal to the average fluid in the whole 
graph. The advantage of this method is that nodes with relatively low fluids will 
be skipped, avoiding unprofitable updates operation, with a searching cost lower 
than picking up the best node at each iteration. 

Theorem 3. Using Dl-argmax, we have: 



(14) 


The proof is immediate, as by construction we have \Fk\ < (1 — i^)|Ffe-i|. 
Note that the Theorem proves the convergence of Dl-argmax, which is not 
a fair scheduler: for instance, after some time, it will ignore the transient nodes 
of the graph, which eventually have no fluid left. We conjecture that (ITHI is not 
tight (tightness would require to always choose an average node) and that the 
actual convergence should be faster. 

3.5 Update equation 

The existing iterative methods (such as Gauss-Seidel iteration. Power Iteration, 
...) can naturally adapt the iteration when G (and thus P) is changed because 
they are in general independent of the initial condition (for any starting vector, 
the iterative scheme converges to the unique limit). The simplest way to adjust 
is to compute the PageRank of the new graph using the previous computation 
as starting vector. 



This technique cannot be used in the case of DI so we need to provide an 
adapted result. 

Theorem 4. Assume that after diffusions, the DI algorithm has computed the 
values {Hko,Fk„) for some matrix P, and consider a new matrix P' (typically 
an update of P). One can compute the unique solution of the equation x' = 
dP'x' + {1 — d)Z hy running a new D-Iteration with starting parameters 


K = Fko + d{P' - P)Hu„ 
Ho = Hko- 


(15) 

(16) 


A few remarks on Theorem 2) 

— It implies that one can continue the diffusion process when P is regularly 
updated: we just need to inject in the system a fluid quantity equal to d{P' — 
P)Hka and then change to the new matrix P', keeping the same history. 

— The precision of the result directly relates to the quantity of fluid left. Here 
the precision induced by + d{P' — P)Hka seems rather minimal, as the 
original fluid is only altered by the difference between the two matrices. In 
particular, if the difference P — P' is small, the update should be quickly 
absorbed. 

— For sake of clarity, we assumed that the set of nodes is the same between P 
and P', but the result can be extended to cope with variations of V. 

— One can notice that this update may introduce negative fluids (this is indeed 
mandatory to correct values that are higher in Hkg than in x'), but this has 
no practical impact on computation. 

Proof. Call Hfg the asymptotic result of the new D-Iteration. We first use m 
on the reduced history H'f. — Hkg (Equation (fT^ requires that the history is 
initially empty): 


(H', - Hkg) + F' = F' + dP'{H', - Hkg). 


Letting fc go to oo leads to 


{H(, - Hkg) =F(+ dP'{H'^ - Hkg) 


= Fkg + d{P' - P)Hkg + dP'{H'^ - Hkg) 
= dP'H'^ + Fkg - dPHkg, 


which can be written 


H'^ = dP'H'^ + Hkg + Fkg - dPHkg. 
Equation (IT^ [Hkg + Fkg = Fq + dPHkg.) concludes the proof. 


4 Numerical evaluation 


In this section, we compare the convergence speed of DI with other algorithms 
exposed in Section O Power Iteration, Gauss-Seidel, and OPIC. 

The results shown in this paper are based on the uk-06-07 Web graph, which 
is compressed using techniques in 0)11] and available at [Hj. This is a crawl done 
for DELIS project [5], a one-year snapshot (from May 2006 to May 2007) of the 
• uk domain. It contains 133 million nodes and 5.5 billion links. 


4.1 Settings 

All the algorithms are tuned to solve Equation ([3]) using d = 0.85 and Z = ^. 

In the case of OPIC, remind that the original version does not take into 
account of damping factor. In order be consistent with other algorithms, we 
emulate ([3]) by running OPIC on the stochastic matrix P' defined by: 

^ a dangling node, (17) 

I otherwise. 

n 

Note that this emulation makes each diffusion rather costly as all entries need 
to be updated. It is only introduced to allow comparison with other methods, 
assuming all diffusions have the same cost, and should not be used in practice. 

For DI, we used the two exposed variants, Dl-cyc and Dl-argmax. The same 
schedulers were used for OPIC, called OPIC-cyc and DPIC-argmax. Remember 
that the fluid amount F is constant in OPIC, so the threshold that triggers 
diffusion in DPIC-argmax is constant (it is the average fluid). 

4.2 Results 

The results are shown in Figure [T] 

The x-axis counts the number of rounds required used by the algorithms. For 
pull methods like PI and CS, a round corresponds to updating the n entries. For 
push methods like DI and OPIC, a round corresponds to n elementary diffusion 
steps. 

The y-axis indicates the precision of the current vector xi at round com¬ 
pared to the real PageRank vector x (the distance to the limit or \x — xi\). With 
DI, the precision can be directly deduced from |F„/(I — d — de„)|. For other 
methods, a 10“® approximation of x was precomputed using DI. 

The OPIC methods perform well during the first few rounds, with a clear 
advantage of DPIC-argmax over DPIC-cyc. However, the convergence slows down 
really hard after. Note that OPIC remains interesting as its primary goal was 
to provide lightweight PageRank estimates. The results only state that OPIC 
should not be used for precise PageRank estimations 

Power Iteration has asymptotically the smallest convergence speed after OPIC, 
but a good starting performance. As a result, it remains a good candidate for 




Fig. 1: Convergence to the stationary PageRank vector 


rough estimates (up to 10“^), but other methods should be used if one needs to 
be more accurate. 

The Gauss-Seidel method has the second best convergence speed, although it 
does not benefit from the quick start observed with PI. It still is a good candidate 
if one needs a simple and efficient way to compute a precise PageRank. 

Dl-cyc performs very similarly to Gauss-Seidel (although a little slower). 
This is in line with our interpretation that Dl-cyc is a kind of push version of 
Gauss-Seidel. 

Lastly, Dl-argmax clearly outperforms the other methods, being in par with 
DPIC-argmax after two rounds and achieving in 7 rounds the precision reached 
by Gauss-Seidel in 20 rounds. This shows that Dl-argmax is a method of choice 
for all intended precisions, and suggests that Equation (d really underestimates 
its actual performance. 

5 Conclusion 

In this paper, we proposed an algorithm based on a diffusion approach, to solve 
the PageRank equation. We demonstrated some theoretical results concerning 
the correctness (convergence), the precision measurement and update equations. 
Our algorithm shows its potential through experiments on real data in compar¬ 
ison with other classical pull (Power Iteration, Gauss-Seidel) and push (OPIG) 
methods. For future works, we plan to focus on refining the theoretical conver¬ 
gence results and on how to adapt and implement DI in a distributed manner. 
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